{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Tiger Tutorial: Solving POMDPs"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This tutorial outlines how to define a POMDP using the [POMDPs.jl](https://github.com/JuliaPOMDP/POMDPs.jl) interface. We will go through a simple problem simply known as the tiger problem (we will refer to it as the tiger POMDP). After defining the tiger POMDP, we will use QMDP and SARSOP to solve the POMDP. If you are new to working with this package, check out the [tutorial](http://nbviewer.ipython.org/github/JuliaPOMDP/POMDPs.jl/blob/master/examples/GridWorld.ipynb) on MDPs first."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Dependencies\n",
    "You need to install a few modules in order to use this notebook. If you have all the modules below installed, great! If not run the following commands:\n",
    "\n",
    "```julia\n",
    "# install the POMDPs.jl interface\n",
    "Pkg.clone(\"https://github.com/sisl/POMDPs.jl.git\")\n",
    "\n",
    "using POMDPs\n",
    "\n",
    "# install the SARSOP wrapper\n",
    "POMDPs.add(\"SARSOP\") # note this downloads and builds the APPL toolit and may take a few minutes \n",
    "\n",
    "# install the QMDP solver\n",
    "POMDPs.add(\"QMDP\")\n",
    "\n",
    "# install a helper modules\n",
    "POMDPs.add(\"POMDPToolbox\") # this provides implementations of discrete belief updating\n",
    "\n",
    "# install a Julia packages for working with distributions\n",
    "Pkg.add(\"Distributions\")\n",
    "```\n",
    "\n",
    "If you already have all of the modules above, make sure you have the most recent versions. Many of these are still under heavy development, so update before starting by running\n",
    "\n",
    "```julia\n",
    "Pkg.update()\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Problem Overview\n",
    "In the tiger POMDP, the agent is tasked with escaping from a room. There are two doors leading out of the room. Behind one of the doors is a tiger, and behind the other is sweet, sweet freedom. If the agent opens the door and finds the tiger, it gets eaten (and receives a reward of -100). If the agent opens the other door, it escapes and receives a reward of 10. The agent can also listen. Listening gives a noisy measurement of which door the tiger is hiding behind. Listening gives the agent the correct location of the tiger 85% of the time. The agent receives a reward of -1 for listening. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# first import POMDPs.jl\n",
    "using POMDPs"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## POMDP \n",
    "A POMDP is defined by the tuple\n",
    "$$(\\mathcal{S}, \\mathcal{A}, \\mathcal{Z}, T, R, O).$$\n",
    "In addition to the familiar, state $\\mathcal{S}$ and action $\\mathcal{A}$ spaces, we must also define an observation space $\\mathcal{Z}$ and an observation function $O$. The POMDP problem definition may be similar to the one for MDP. For example, if you wanted to add state uncertainty to your problem, you can define the observation space, and observation function in addition to your previous MDP definition.\n",
    "\n",
    "Before defining the spaces for this problem, let's first define the concrete type for the tiger POMDP. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "type TigerPOMDP <: POMDP{Bool, Symbol, Bool} # POMDP{State, Action, Observation} all parametarized by Int64s\n",
    "    r_listen::Float64 # reward for listening (default -1)\n",
    "    r_findtiger::Float64 # reward for finding the tiger (default -100)\n",
    "    r_escapetiger::Float64 # reward for escaping (default 10)\n",
    "    p_listen_correctly::Float64 # prob of correctly listening (default 0.85)\n",
    "    discount_factor::Float64 # discount\n",
    "end\n",
    "# default constructor\n",
    "function TigerPOMDP()\n",
    "    return TigerPOMDP(-1.0, -100.0, 10.0, 0.85, 0.95)\n",
    "end;"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There are a number of parameters in the problem definition, but we can treat them all as constants. You can read more about the Tiger problem and POMDPs [here](http://www.techfak.uni-bielefeld.de/~skopp/Lehre/STdKI_SS10/POMDP_tutorial.pdf#page=28). However, we created a default constructor that allows us to initialize the tiger POMDP by simply running:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "TigerPOMDP(-1.0, -100.0, 10.0, 0.85, 0.95)"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pomdp = TigerPOMDP()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that the TigerPOMDP type inherits from a ```POMDP{Bool, Symbol, Bool}```. This means that in our problem we will use ```Bool``` to represent our states, ```Symbol``` to represent our actions and ```Bool``` to represent our observations. More details on states, actions and observations in this problem are below."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## States\n",
    "We define our state with a boolean that indicates weather or not the tiger is hiding behind the left door. If our state is ```true```, the tiger is behind the left door. If its ```false```, the tiger is behind the right door. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "false"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "example_state = false # tiger is hiding behind right door"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Since the state is a binary value, we represent it as a boolean, but we could have represented it as an integer or any other sensible type."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Actions\n",
    "There are three possible actions our agent can take: open the left door, open the right door, and listen. For clarity, we will represent these with symbols."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       ":listen"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "example_action = :listen # agent listens, can be :openl or :openr"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will represent our actions with the following symbols: open left (:openl), open right (:openr), and listen (:listen). For example, the action below represnts listening:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       ":listen"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "a = :listen"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Observations\n",
    "There are two possible observations: the agent either hears the tiger behind the left door or behind the right door. We use a boolean to represent the observations:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "true"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "example_observation = true # agent heard the tiger behind the left door"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Spaces\n",
    "Let's define our state, action and observation spaces."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### State Space\n",
    "There are only two states in the tiger POMDP: the tiger is either behind the left door or behind the right door. Our state space is simply an array of the states in the problem.  Recall, that the ```states``` function returns the state space for a given POMDP type."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "POMDPs.states(::TigerPOMDP) = [true, false];"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Action Space\n",
    "There are three actions in our problem. Once again, we represent the action space as an array of the actions in our problem. The ```actions``` function serve a similar purpose to the ```states``` function above. Since the action space is discrete, we can define the ```action_index``` function that associates an integer index to each action."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "POMDPs.actions(::TigerPOMDP) = [:openl, :openr, :listen] # default\n",
    "POMDPs.actions(pomdp::TigerPOMDP, state::Bool) = POMDPs.actions(pomdp) # convenience (actions do not change in different states)\n",
    "function POMDPs.action_index(::TigerPOMDP, a::Symbol)\n",
    "    if a==:openl\n",
    "        return 1\n",
    "    elseif a==:openr\n",
    "        return 2\n",
    "    elseif a==:listen\n",
    "        return 3\n",
    "    end\n",
    "    error(\"invalid TigerPOMDP action: $a\")\n",
    "end;"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Observation Space\n",
    "The observation space looks similar to the state space. Recall that the state represents the truth about our system, while the observation is potentially false information recieves about the state. In the tiger POMDP, our observation could give us a false representation of our state. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# function returning observation space\n",
    "POMDPs.observations(::TigerPOMDP) = [true, false];\n",
    "POMDPs.observations(pomdp::TigerPOMDP, s::Bool) = observations(pomdp);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now that we've defined the POMDP spaces, let's move on to defining the model functions."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Transition and Observation Distributions\n",
    "Before defining the model functions, we first need to create a distributions data-type. In general, our distributions should support sampling and have a ```pdf``` method. If you only want to get a policy from the SARSOP and QMDP solvers, you do not need to worry about implementing a sampling function. However, if you want to simulate the policy, you should implement these functions.\n",
    "\n",
    "Since the transition and observation distributions have identical form, we could just use a single type to serve the needs of both. This will not be the case in general."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# distribution type that will be used for both transitions and observations\n",
    "type TigerDistribution\n",
    "    p::Float64\n",
    "    it::Vector{Bool}\n",
    "end\n",
    "TigerDistribution() = TigerDistribution(0.5, [true, false])\n",
    "POMDPs.iterator(d::TigerDistribution) = d.it"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We now define the ```pdf``` function. For a discrete problem, this function returns the probability of a given element (state or observation in our case)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# transition and observation pdf\n",
    "function POMDPs.pdf(d::TigerDistribution, so::Bool)\n",
    "    so ? (return d.p) : (return 1.0-d.p)\n",
    "end;"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, let's create the sampling functions. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# samples from transition or observation distribution\n",
    "POMDPs.rand(rng::AbstractRNG, d::TigerDistribution) = rand(rng) <= d.p;"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This is all we have to do for our distribution functions!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Transition Model\n",
    "Here we define the transition model for the tiger POMDP. The model itself is fairly simple. Our state is represented by the location of the tiger (left or right). The location of the tiger doesn't change when the agent listens. However, after the agent opens the door, it reaches a terminal state. That is the agent either escapes or gets eaten. To simplify our formulation, we simply reset the location of the tiger randomly. We could model this problem with a terminal state (i.e. one in which the agent no longer receives reward) as well. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Resets the problem after opening door; does nothing after listening\n",
    "function POMDPs.transition(pomdp::TigerPOMDP, s::Bool, a::Symbol)\n",
    "    d = TigerDistribution()\n",
    "    if a == :openl || a == :openr\n",
    "        d.p = 0.5\n",
    "    elseif s\n",
    "        d.p = 1.0\n",
    "    else\n",
    "        d.p = 0.0\n",
    "    end\n",
    "    d\n",
    "end;"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Reward Model\n",
    "The reward model caputres the immediate objectives of the agent. It recieves a large negative reward for opening the door with the tiger behind it (-100), gets a positive reward for opening the other door (+10), and a small penalty for listening (-1)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# reward model\n",
    "function POMDPs.reward(pomdp::TigerPOMDP, s::Bool, a::Symbol)\n",
    "    r = 0.0\n",
    "    a == :listen ? (r+=pomdp.r_listen) : (nothing)\n",
    "    if a == :openl\n",
    "        s ? (r += pomdp.r_findtiger) : (r += pomdp.r_escapetiger)\n",
    "    end\n",
    "    if a == :openr\n",
    "        s ? (r += pomdp.r_escapetiger) : (r += pomdp.r_findtiger)\n",
    "    end\n",
    "    return r\n",
    "end\n",
    "POMDPs.reward(pomdp::TigerPOMDP, s::Bool, a::Symbol, sp::Bool) = reward(pomdp, s, a);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Observation Model\n",
    "The observation model captures the uncertaintiy in the agent's lsitening ability. When we listen, we receive a noisy measurement of the tiger's location. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# observation model\n",
    "function POMDPs.observation(pomdp::TigerPOMDP, a::Symbol, s::Bool)\n",
    "    d = TigerDistribution()\n",
    "    pc = pomdp.p_listen_correctly\n",
    "    if a == :listen\n",
    "        s ? (d.p = pc) : (d.p = 1.0-pc)\n",
    "    else\n",
    "        d.p = 0.5\n",
    "    end\n",
    "    d\n",
    "end;"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Miscallenous Functions"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's define the ```discount``` function and the functions that return the size of our spaces."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "POMDPs.discount(pomdp::TigerPOMDP) = pomdp.discount_factor\n",
    "POMDPs.n_states(::TigerPOMDP) = 2\n",
    "POMDPs.n_actions(::TigerPOMDP) = 3\n",
    "POMDPs.n_observations(::TigerPOMDP) = 2;"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Beliefs\n",
    "If you are somewhat familiar with Julia defining all of the above may have been relaitvely simple. However, all POMDPs must be represented with a belief. Implementing beliefs and their updaters can be tricky. Luckily, our solvers abstract away the belief updating. All you need to do is define a function that returns an initial distriubtion over states. This distribution needs to support ```pdf``` and ```rand``` function. We already defined a dsitribution like that, so our job here is simple!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "POMDPs.initial_state_distribution(pomdp::TigerPOMDP) = TigerDistribution(0.5, [true, false]);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If you are interested in creating your own beliefs and update schemes check out the [POMDPToolbox](https://github.com/JuliaPOMDP/POMDPToolbox.jl) module which implements a number of beliefs and udpate schemes."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## SARSOP Solver\n",
    "Let's play around with the [SARSOP.jl](https://github.com/sisl/SARSOP.jl) solver. The module we provide is a wrapper for the SARSOP backend. You can find more information about it [here](http://bigbird.comp.nus.edu.sg/pmwiki/farm/appl/)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Generating a pomdpx file: model.pomdpx\n",
      "\n",
      "Loading the model ...\n",
      "  input file   : model.pomdpx\n",
      "  loading time : 0.00s \n",
      "\n",
      "SARSOP initializing ...\n",
      "  initialization time : 0.00s\n",
      "\n",
      "-------------------------------------------------------------------------------\n",
      " Time   |#Trial |#Backup |LBound    |UBound    |Precision  |#Alphas |#Beliefs  \n",
      "-------------------------------------------------------------------------------\n",
      " 0       0       0        -20        92.8206    112.821     3        1        \n",
      " 0       2       51       -6.2981    63.1396    69.4377     7        16       \n",
      " 0       4       103      0.149651   52.2764    52.1268     9        21       \n",
      " 0       6       151      6.19248    42.0546    35.8621     9        21       \n",
      " 0       8       200      10.3563    35.232     24.8757     12       21       \n",
      " 0       11      250      14.0433    29.5471    15.5037     6        21       \n",
      " 0       14      300      16.545     25.0926    8.54759     10       21       \n",
      " 0       17      350      18.2281    21.8163    3.5882      14       21       \n",
      " 0       18      400      18.7451    20.9384    2.19328     8        21       \n",
      " 0       21      465      19.1109    20.0218    0.910956    5        21       \n",
      " 0       22      500      19.2369    19.7071    0.470219    11       21       \n",
      " 0       24      550      19.3036    19.5405    0.236865    6        21       \n",
      " 0       25      600      19.3369    19.4574    0.120445    13       21       \n",
      " 0       27      669      19.3579    19.4049    0.0469305   5        21       \n",
      " 0       28      713      19.3643    19.389     0.024739    5        21       \n",
      " 0       29      757      19.3676    19.3807    0.0130409   5        21       \n",
      " 0       30      801      19.3694    19.3763    0.0068744   5        21       \n",
      " 0       31      850      19.3704    19.3739    0.00351433  10       21       \n",
      " 0       32      900      19.3709    19.3725    0.00155165  5        21       \n",
      " 0       33      936      19.3711    19.3721    0.000976551 8        21       \n",
      "-------------------------------------------------------------------------------\n",
      "\n",
      "SARSOP finishing ...\n",
      "  target precision reached\n",
      "  target precision  : 0.001000\n",
      "  precision reached : 0.000977 \n",
      "\n",
      "-------------------------------------------------------------------------------\n",
      " Time   |#Trial |#Backup |LBound    |UBound    |Precision  |#Alphas |#Beliefs  \n",
      "-------------------------------------------------------------------------------\n",
      " 0       33      936      19.3711    19.3721    0.000976551 5        21       \n",
      "-------------------------------------------------------------------------------\n",
      "\n",
      "Writing out policy ...\n",
      "  output file : out.policy\n",
      "\n"
     ]
    }
   ],
   "source": [
    "using SARSOP # load the module\n",
    "# initialize our tiger POMDP\n",
    "pomdp = TigerPOMDP()\n",
    "\n",
    "# initialize the solver\n",
    "solver = SARSOPSolver()\n",
    "# run the solve function\n",
    "policy = solve(solver, pomdp)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "2×5 Array{Float64,2}:\n",
       " -81.5975   3.01448  24.6954    28.4025  19.3711\n",
       "  28.4025  24.6954    3.01452  -81.5975  19.3711"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# we can retrieve the alpha vectors by calling\n",
    "alphas(policy)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's now see how our policy changes with the belief."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# use POMDPToolbox for beliefs\n",
    "using POMDPToolbox\n",
    "\n",
    "# let's initialize the beliefs\n",
    "b = DiscreteBelief(2); # the initial prior"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       ":listen"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "a = action(policy, b) # index of action, you need to convert this to the true action, support for automatic conversion is coming soon"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's simulate our policy. We'll use POMDPToolbox to do the simulation. As mentioned earlier, in a POMDP, the decision is based on a belief. However, each policy (comes from the solver modules) implements its own belief udpating scheme, so you do not need to worry about deling with beliefs. The only thing you need is to define an ```initial_state_distribution```. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "s: true, b: [0.5, 0.5], action: listen, obs: true\n",
      "s: true, b: [0.15, 0.85], action: listen, obs: true\n",
      "s: true, b: [0.0302013, 0.969799], action: openr, obs: false\n",
      "s: true, b: [0.5, 0.5], action: listen, obs: true\n",
      "s: true, b: [0.15, 0.85], action: listen, obs: true\n",
      "s: true, b: [0.0302013, 0.969799], action: openr, obs: true\n",
      "s: true, b: [0.5, 0.5], action: listen, obs: true\n",
      "s: true, b: [0.15, 0.85], action: listen, obs: true\n",
      "s: true, b: [0.0302013, 0.969799], action: openr, obs: true\n",
      "s: true, b: [0.5, 0.5], action: listen, obs: true\n",
      "s: true, b: [0.15, 0.85], action: listen, obs: false\n",
      "s: true, b: [0.5, 0.5], action: listen, obs: true\n",
      "s: true, b: [0.15, 0.85], action: listen, obs: true\n",
      "s: true, b: [0.0302013, 0.969799], action: openr, obs: true\n",
      "Total reward: 21.136977555064835\n"
     ]
    }
   ],
   "source": [
    "using POMDPToolbox # for simulation\n",
    "\n",
    "pomdp = TigerPOMDP() # initialize problem\n",
    "init_dist = initial_state_distribution(pomdp) # initialize distriubtion over state\n",
    "\n",
    "up = updater(policy) # belief updater for our policy, SARSOP uses a discrete Bayesian filter\n",
    "hr = HistoryRecorder(max_steps=14, rng=MersenneTwister(1)) # history recorder that keeps track of states, observations and beliefs\n",
    "\n",
    "hist = simulate(hr, pomdp, policy, up, init_dist)\n",
    "\n",
    "for (s, b, a, r, sp, op) in hist\n",
    "    println(\"s: $s, b: $(b.b), action: $a, obs: $op\")\n",
    "end\n",
    "println(\"Total reward: $(discounted_reward(hist))\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Notice that over the first nine time steps, the policy is fairly simple. We listen twice, and then decide which door to open. However, in the following steps, we get a mix of observations, which makes the decision harder. Our agent does not open a door, because its belief is still uniform at the last time step! "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## QMDP Solver\n",
    "We will briefly go over the [QMDP.jl](https://github.com/sisl/QMDP.jl) solver. You should use QMDP with a word of caution. QMDP assumes that all state uncetainty dissapears in the next time step. This could lead to bad policies in problems with information gathering actions. For example, in the tiger POMDP listening is an information gathering action, and the resulting QMDP policy is quite poor. However, QMDP can work very well in problems where the state uncertainity is not impacted by the agent's action (for example systems with noisy sensor measurements)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[Iteration 1   ] residual:       14.8 | iteration runtime:      0.045 ms, (  4.53E-05 s total)\n",
      "[Iteration 2   ] residual:       12.6 | iteration runtime:      0.013 ms, (  5.78E-05 s total)\n",
      "[Iteration 3   ] residual:       11.6 | iteration runtime:      0.003 ms, (  6.13E-05 s total)\n",
      "[Iteration 4   ] residual:       10.9 | iteration runtime:      0.003 ms, (  6.43E-05 s total)\n",
      "[Iteration 5   ] residual:       10.3 | iteration runtime:      0.003 ms, (  6.7E-05 s total)\n",
      "[Iteration 6   ] residual:       9.59 | iteration runtime:      0.002 ms, (  6.94E-05 s total)\n",
      "[Iteration 7   ] residual:       8.96 | iteration runtime:      0.003 ms, (  7.21E-05 s total)\n",
      "[Iteration 8   ] residual:       8.37 | iteration runtime:      0.002 ms, (  7.45E-05 s total)\n",
      "[Iteration 9   ] residual:       7.82 | iteration runtime:      0.002 ms, (  7.69E-05 s total)\n",
      "[Iteration 10  ] residual:        7.3 | iteration runtime:      0.002 ms, (  7.92E-05 s total)\n",
      "[Iteration 11  ] residual:       6.82 | iteration runtime:      0.003 ms, (  8.18E-05 s total)\n",
      "[Iteration 12  ] residual:       6.37 | iteration runtime:      0.003 ms, (  8.47E-05 s total)\n",
      "[Iteration 13  ] residual:       5.95 | iteration runtime:      0.002 ms, (  8.71E-05 s total)\n",
      "[Iteration 14  ] residual:       5.56 | iteration runtime:      0.002 ms, (  8.95E-05 s total)\n",
      "[Iteration 15  ] residual:       5.19 | iteration runtime:      0.003 ms, (  9.22E-05 s total)\n",
      "[Iteration 16  ] residual:       4.85 | iteration runtime:      0.003 ms, (  9.48E-05 s total)\n",
      "[Iteration 17  ] residual:       4.53 | iteration runtime:      0.002 ms, (  9.72E-05 s total)\n",
      "[Iteration 18  ] residual:       4.23 | iteration runtime:      0.002 ms, (  9.97E-05 s total)\n",
      "[Iteration 19  ] residual:       3.95 | iteration runtime:      0.002 ms, (  0.000102 s total)\n",
      "[Iteration 20  ] residual:       3.69 | iteration runtime:      0.002 ms, (  0.000105 s total)\n",
      "[Iteration 21  ] residual:       3.45 | iteration runtime:      0.003 ms, (  0.000107 s total)\n",
      "[Iteration 22  ] residual:       3.22 | iteration runtime:      0.003 ms, (   0.00011 s total)\n",
      "[Iteration 23  ] residual:       3.01 | iteration runtime:      0.003 ms, (  0.000113 s total)\n",
      "[Iteration 24  ] residual:       2.81 | iteration runtime:      0.003 ms, (  0.000115 s total)\n",
      "[Iteration 25  ] residual:       2.62 | iteration runtime:      0.003 ms, (  0.000118 s total)\n",
      "[Iteration 26  ] residual:       2.45 | iteration runtime:      0.002 ms, (   0.00012 s total)\n",
      "[Iteration 27  ] residual:       2.29 | iteration runtime:      0.003 ms, (  0.000123 s total)\n",
      "[Iteration 28  ] residual:       2.14 | iteration runtime:      0.003 ms, (  0.000126 s total)\n",
      "[Iteration 29  ] residual:          2 | iteration runtime:      0.002 ms, (  0.000128 s total)\n",
      "[Iteration 30  ] residual:       1.87 | iteration runtime:      0.002 ms, (   0.00013 s total)\n",
      "[Iteration 31  ] residual:       1.74 | iteration runtime:      0.003 ms, (  0.000133 s total)\n",
      "[Iteration 32  ] residual:       1.63 | iteration runtime:      0.003 ms, (  0.000136 s total)\n",
      "[Iteration 33  ] residual:       1.52 | iteration runtime:      0.002 ms, (  0.000138 s total)\n",
      "[Iteration 34  ] residual:       1.42 | iteration runtime:      0.003 ms, (  0.000141 s total)\n",
      "[Iteration 35  ] residual:       1.33 | iteration runtime:      0.002 ms, (  0.000143 s total)\n",
      "[Iteration 36  ] residual:       1.24 | iteration runtime:      0.002 ms, (  0.000145 s total)\n",
      "[Iteration 37  ] residual:       1.16 | iteration runtime:      0.002 ms, (  0.000148 s total)\n",
      "[Iteration 38  ] residual:       1.08 | iteration runtime:      0.003 ms, (   0.00015 s total)\n",
      "[Iteration 39  ] residual:       1.01 | iteration runtime:      0.003 ms, (  0.000153 s total)\n",
      "[Iteration 40  ] residual:      0.944 | iteration runtime:      0.002 ms, (  0.000155 s total)\n",
      "[Iteration 41  ] residual:      0.882 | iteration runtime:      0.002 ms, (  0.000158 s total)\n",
      "[Iteration 42  ] residual:      0.823 | iteration runtime:      0.003 ms, (  0.000161 s total)\n",
      "[Iteration 43  ] residual:      0.769 | iteration runtime:      0.003 ms, (  0.000164 s total)\n",
      "[Iteration 44  ] residual:      0.718 | iteration runtime:      0.003 ms, (  0.000167 s total)\n",
      "[Iteration 45  ] residual:      0.671 | iteration runtime:      0.003 ms, (   0.00017 s total)\n",
      "[Iteration 46  ] residual:      0.627 | iteration runtime:      0.002 ms, (  0.000172 s total)\n",
      "[Iteration 47  ] residual:      0.586 | iteration runtime:      0.003 ms, (  0.000175 s total)\n",
      "[Iteration 48  ] residual:      0.547 | iteration runtime:      0.003 ms, (  0.000177 s total)\n",
      "[Iteration 49  ] residual:      0.511 | iteration runtime:      0.002 ms, (   0.00018 s total)\n",
      "[Iteration 50  ] residual:      0.477 | iteration runtime:      0.002 ms, (  0.000182 s total)\n"
     ]
    }
   ],
   "source": [
    "using QMDP\n",
    "\n",
    "# initialize the solver\n",
    "# key-word args are the maximum number of iterations the solver will run for, and the Bellman tolerance\n",
    "solver = QMDPSolver(max_iterations=50, tolerance=1e-3) \n",
    "\n",
    "# run the solver\n",
    "qmdp_policy = solve(solver, pomdp, verbose=true)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "2×3 Array{Float64,2}:\n",
       " 193.239    83.2389  182.124\n",
       "  83.4656  193.466   182.354"
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "qmdp_policy.alphas"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Notice that these alpha-vectors differ from those compute by SARSOP. Let's see how the policy looks in simulation. We'll use the same procedure to simulate the QMDP policy. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "s: true, b: [0.5, 0.5], action: listen, obs: true\n",
      "s: true, b: [0.15, 0.85], action: listen, obs: true\n",
      "s: true, b: [0.0302013, 0.969799], action: openr, obs: false\n",
      "s: true, b: [0.5, 0.5], action: listen, obs: true\n",
      "s: true, b: [0.15, 0.85], action: listen, obs: true\n",
      "s: true, b: [0.0302013, 0.969799], action: openr, obs: true\n",
      "s: true, b: [0.5, 0.5], action: listen, obs: true\n",
      "s: true, b: [0.15, 0.85], action: listen, obs: true\n",
      "s: true, b: [0.0302013, 0.969799], action: openr, obs: true\n",
      "s: true, b: [0.5, 0.5], action: listen, obs: true\n",
      "s: true, b: [0.15, 0.85], action: listen, obs: false\n",
      "s: true, b: [0.5, 0.5], action: listen, obs: true\n",
      "s: true, b: [0.15, 0.85], action: listen, obs: true\n",
      "s: true, b: [0.0302013, 0.969799], action: openr, obs: true\n",
      "Total reward: 21.136977555064835\n"
     ]
    }
   ],
   "source": [
    "pomdp = TigerPOMDP() # initialize problem\n",
    "init_dist = initial_state_distribution(pomdp) # initialize distriubtion over state\n",
    "\n",
    "up = updater(qmdp_policy) # belief updater for our policy\n",
    "hist = HistoryRecorder(max_steps=14, rng=MersenneTwister(1)) # history recorder that keeps track of states, observations and beliefs\n",
    "\n",
    "hist = simulate(hist, pomdp, qmdp_policy, up, init_dist)\n",
    "\n",
    "for (s, b, a, r, sp, op) in hist\n",
    "    println(\"s: $s, b: $(b.b), action: $a, obs: $op\")\n",
    "end\n",
    "println(\"Total reward: $(discounted_reward(hist))\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The results are identical for this single simulation for this particular problem. In general, if you are dealing with a complex problem, it is good to compare the SARSOP and QMDP policies. This framework makes comparing the two policies very simple once you have defined the problem! "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Julia 0.6.1",
   "language": "julia",
   "name": "julia-0.6"
  },
  "language_info": {
   "file_extension": ".jl",
   "mimetype": "application/julia",
   "name": "julia",
   "version": "0.6.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
